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Analyses of Baby Name Popularity Distribution in 
U.S. for the Last 131 Years 

Wentian Li 

The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research 
North Shore LI J Health System, Manhasset, 350 Community Drive, NY 11030, USA. 

We examine the complete dataset of baby name popularity collected by U.S. Social Security Administration 
for the last 131 years (1880-2010). The ranked baby name popularity can be fitted empirically by a piecewise 
function consisting of Beta function for the high-ranking names and power-law function for low-ranking names, 
but not power-law (Zipf's law) or Beta function by itself. 

1 Introduction 

Zipf's law describes a class of ranked distributions, in which the ranked quantity y falls off 
with rank r by y r = C/r a (a « 1). This "law" was originally observed in the word usage in 



human languages ( 1Zipf.lll935l ) where y r is the number of times (number of "token") the rank-r 



word (a "type") ap pears in a language te xt. But Zipf's l aw also fit many othe r datasets, such 



( Rowlands. 



2000l ). and many others ( 
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as city populatio n ( iBrakman et al. J 12009), com pany size (ISaichev et al..ll2009l ). internet traffic 



20021 ). 



When a ranked quantity deviates from a perfect Zipf's law, it is often dismissed as a "finite 
size effect", i.e., fluctuation in the ranked quantity for low-ranking entities caused by small 
sample size. The assumption is that the fluctuation will be reduced in a larger sample size, and 
the Zipf's law would be preserved. Unsatisfied with this explanation, Cocho and his colleagues 



proposed a new class of rank function simil ar in form to t 
that curves in a loK(rank)-log (quantity) plot (IMansilla et al.. 



Martfnez-Mekler et al.. 



re Be t a probability distribution 



2007 



Naumis and Cocho 



2008 



20091 ). This class of Beta rank functions has been proven to be an 
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excellent fitting function of many datasets ranging from number of collaborators in a social 
netwo rk, number of citations per article and pe r journal 



2010). to musical score anc 



Li and Miramontes 



2011 



alphabet frequencies 



Petersen et al.. 



2011 



Li, et al.. 
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2012af ) 



Mansil la et al.. 



2010 



2007 



Campanario 



Alvarez-Martinez et a;.. 



2011 



The success of the Beta rank function may lead to a belief that we may have a function 
universal enough to fit all ranked datasets. I ndeed, there have be en works to prove a universal 



mechanism behind the Beta rank function (Beltran et al. 



function to a probability distribution ( iSarabia et al.. 



201 ll ) . or to relate the Beta rank 



20121 ). It was also an expectation that 



it may fit the baby name popularity data in U.S., to be analyzed in this paper, which is well 
documented by the U.S. Social Security Administration since 1880. If the number of babies 
with the same first name (born in the same year and with the given gender) are ranked by 
popularity {{y r } for rank r = 1, 2, • • • ), will the y r ~ r relationship be a power-law function 
(thus as another example of the Zipf's law)? Or will it be a Beta ranked function? 
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Figure 1: Top: The most popular boy (blue) and girl (pink) names as the percent of total births; Bottom: 
The numer of samples included in the datasets (top 1000 and +5 occurrences) as the percentage of total births. 
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2 The temporal trend of baby name popularity data 

The baby name popularity data in U.S. since 1880 is available from the U.S. Social Security 



Administration (SSA) site: http://www.ssa.gov/oact/babynames/\ The boy and girl's names 



are listed separately. Two versions of the data can be downloaded from the website: the web 
form version for top 1000 (gender-specific) names, and the flatfile for all names with 5 or 
more occurrences. Baby names with less than 5 occurrences are not given to the public to 
"safeguard privacy". Since both data are not 100% complete, we check how much of all births 
are included in the data by using the information of the top-1 baby name. The most popular 
baby name as a percentage of all births is plotted in FigJT] (top). The number of all births 
(not directly provided by the SSA site) can be derived: n to tai = n r= i/p r= i. Figfj] (bottom) 
shows the percentage of births included in the two datasets (top 1000 and +5 occurrences): 

PtopWOO — Z^r=l n r/ r >'totah P+5 ~ l^in r >b Ur I n total 

The top baby name was more dominant in early years than in recent years (the top boy 
names were John (1880-1923), Robert (1924-1939, 1953), James (1940-1952), Michael (1954- 
1959, 1961-1998), David (1960), Jacob (1999-2010), and the top girl names were Mary (1880- 
1946, 1953-1961), Linda (1947-1952), Lisa (1962-1969), Jennifer (1970-1984), Jessica (1985- 
1990, 1993-1995), Ashley (1991-1992), Emily (1996-2007), Emma (2008), Isabella (2009-2010)). 
However, to characterize the uneven naming of babies, using the statistics from the top name 
alone is not enough. 

In economics, how the riches dominate over the poors in the wealth distribution can be mea- 
sured by the Gini index: if the riches have all t he wealth, the Gini index is 1; if there is no eco- 



nomic inequality, the Gini index is (see, e.g., (jCowell 



20001 )). A formula for Gini index from 



the ranked data {y r } is G = R 1 (R+1 — 2 X]f=i r W Y^f=i Vr) where R = max(r) is the maxi- 



mum rank (number of entities) (implemented in, e.g., R package reldist (IHandcock and Morris. 



19990 ). 

Figfj] shows the Gini index calculated from both the top 1000 data and +5 occurrence data 
for the last 131 years. There are following observations: First of all, using only the top 1000 
names will underestimate the Gini index, and the degree of underestimation is more severe 
with larger sample sizes. It is because the baby names not included in the data contribute 
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Figure 2: Gini index calculated for two datasets (top 1000 and +5 occurrences) for boys (blue) and girls 
(pink) separately. 

more to the inequality of name popularity. With more babies born in recent years, top 1000 
names cover less (Fig {]] bottom) of all births. 

Secondly, there are more diversities (inequalities), generally speaking , in g irl names than 



in boy names. A similar conclusion was reached in (IBarry and Harper .1 Il995l ) by a regional 
data - the state of Pennsylvania - from comparing baby names in only two years, 1960 and 
1990. This can partially be seen from FigJT]as for most years, the top girl names tend to take 
a smaller percentage of all girl births than the corresponding top boy names. 

Thirdly, there is a weak trend of lower Gini index in re cent (e.g. 30) years, o r more diverse 
name givings. A similar conclusion was also reached in (ITwenge et al..ll2010l ). This can be 
caused by many factors, such as more diverse immigrant groups in recent years. It is not clear 
how the increase of total number of names may contribute to this trend. 



3 Fitting ranked baby name popularity 



To find a better representation of the ranked distribution of baby name popularity, we show 
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Figure 3: Four versions of the ranked baby name popularity data for the years 1910, 1960, and 2010 (boys 
(blue) and girls (pink) are ranked separately): (upper left) log-log; (upper right) log- linear; (lower left) log- log 
where the y is normalized by the number of the most popular baby name (i.e., rank-1 name has y value equal 
to 1); (lower right) log-linear where y is normalized. 

4 different versions of the ranked distribution for the (+5 occurrence) data in years 1910, 1960, 
and 2010 in Figj3j In principle, there are 16= 4x4 versions of plotting, first 4 for linear(x)- 
linear(?/), linear- log, log-linear, log- log, and the other 4 for original(x)-original(?/), normalized- 
original, original-normalized, and normalized-normalized. By normalizing, we mean divided by 
the maximum value. We discard the linear-linear and linear-log (because high-ranking names 
are not highlighted enough in these versions), we also discard the original- normalized and 
normalized-normalized (because the total number of baby names information is not available). 
For the remaining 4 versions in FigfSJ the log-log version seems to be the better representation 
as rank distributions from different years are closer to each other. The ranked normalized baby 
name popularities in log-log scale for every year from 1880 to 2010 are shown in FigJH (lines 
are shifted downward to improve visibility). 

Each year's ranked normalized baby name popularity data in FigJH is fitted by three func- 
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tions, power-law function, Beta rank function, and a two-piece function: 
power-law: \og(y r /y r= i) = C — alog(r) 



Beta: \og(y r /y r= i) 
two-piece: \og{y r /y r=1 ) 



C -a log(r) + b \og(R + 1 - r) 
C - a log(r) + b \og(R + 1 - r) 
C — alog(r) 



r < r 
r > r 



The fi rst is the power-law function (Zipf 's law), the second is the Beta function f lMartmez-Mekler et al 



20091 ). where R = max(r) is the maximum rank in the +5 occurrence data (the total number 



of all baby names, including those with less than 5 occurrences, is unavailable), and the last 
function is a two-piece function with Beta function for high-ranking and power-law function 
for low-ranking regimes. 

The two-piece function is motivated by inspecting the ranked plots in FigJH the high- 
ranking points seem to follow a straight line, whereas the low-ranking points are curved in 
the log- log plot. This choice is also motivated by the fact that the fitting of Beta function 



in the linear-linear scale by the no n-linear least-square regression (IBates and Watts 



1988 



Li and Miramontes 



2011 



Li 



2012al ) does not converge for these datasets (results not shown), 



indicating that Beta function does not fit the data well in the full rank range. 

The segmentation point tq is not known beforehand. We determine its value by minimizing 
the overall sum-of-squared-error (SSE): 

2 



SSE(r') 



r=l 



Vr 



max(y) 



R 



SQ 



+ E ( lo § 

r=r'+l ^ 

arg max SSE(r' 



Vr 



max(y) 



C + alog(r) - blog(R 



— C + alog(r" 



(2) 



To save computing time, we limit the range of choice of r between r = 20 and r = 200. 

Figj5] and Figj6] show all fitting results by these three functions for the baby popularity 
data of the last 131 years. The top plot is the SSE per point (i.e., SSE in Eq.(j2]) divided by 
the maximin rank) for the power-law function, Beta function, and two-piece function. The 
data for boys is in blue color, that for girls in pink. The Beta function reduces the SSE as 
compared to the power-law function, but the reduction is not large. On the other hand, the 



Li 



7 




^ i i i r i i i i i 

1 10 100 1000 10000 1 10 100 1000 10000 

rank rank 

Figure 4: Ranked normalized number of baby names (rank-1 name, or the most popular name, has y value 
equal to 1) for each year from 1880 to 2010 (131 lines). Boys (blue) and girls (pink) names are ranked separately. 
Plots from different years are shift downward to improve visibility. 

two-piece function fits the data much better. For the two-piece function, if the ro is chosen at 
the boundary of our selected range (e.g. ro = 200), it indicates that SSE may be reduced even 
further if we remove the constraint. These datasets are marked by red/darkblue dots in Figj5j 

The second plot in Figj5] is the relative SSE: SSE for Beta function, and SSE for the two- 
piece function, over that for the power-law function. The ratio is only slighly less than 1 for 
the Beta function, but is as low as 0.2-0.3 for two-piece functions. 

The third plot in Figj5] is the relative position of the segmentation point in the logarithmic 
scale: log(r )/log(i?). If this value is 0.5, then Beta and power-law function covers half of the 
log-rank range in the two-piece function. 

Figj6] shows the fitted parameter values for a in the power-law function, in Beta-function, 
in both parts in the two-piece function. The value of a is mostly between 1 and 2, but for the 
Beta function in the two-piece situation, is mostly less than 1. The b value in Beta function 
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fluctuates around 0, but for the Beta function in the two-piece situation becomes much larger 
(median value is 0.14-0.16). 

All these results point to the conclusion that our two-piece function, with Beta function 
for the low-ranking data and power-law function for the high-ranking data, fits the data much 
better than both power-law function and Beta function by themselves. Graphically speaking, 
it is another way to state that in log-log plot, the ranked baby name popularity in a typical 
year curves down, then followed by a straight line. 



power-law 




1880 1900 1920 1940 1960 1980 2000 



Figure 5: Results for boy names are marked by blue, and those for girl names by pink. (1) Sum of square 
errors (SSE) (Eq.Q divided by the maximum rank (i?) for power-law, Beta, and two-piece function in Eq.([T]); 
(2) SSE of Beta and two-piece function normalized by the SSE of the power-law function; (3) relative position 
of the point which separates the Beta and power-law regimes in the two-piece function, in logarithmic scale: 
log(r )/log(iJ). 
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Figure 6: Results for boy names are marked by blue or darkblue, and those for girl names by pink or red. 
See Eq.([T]) for definitions. The a values for power- law, Beta function, the power- law regime of the two-piece 
function, and the Beta regime of the two-piece function, the b values for Beta function, the Beta regime of the 
two-piece function. 

4 Discussion 



The univer sality of power-law d istribution in nature as well as human-related data is heat- 
edly debated (IClauset et al..ll2009l ). To require a power-law to be true for all ranges is clearly 
a very strong c onstraint, and not many data sets can pass the test. The introduction of Beta 
rank function f lMartmez-Mekler et al..ll2009l ) is to relax the strong requirement and increase 
the number of datasets to be well fitted. The main contribution of this paper is that even Beta 
rank function may fail to fit some dataset well, such as the baby name popularity data. 

Our choice of using Beta function for high-ranking names and power-law function for low- 
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ranking names is empirically based on examining the raw data in log-log plot, but it has 
interesting parallels in other fields. It has been observed in linguistic data that Beta rank 
function tends to fit the data very well w hen the maximum rank is limited, for example, for 
the ranked letter frequency distribution ( ILi and Miramontes.1 l201ll ). On the other hand, it 
may not be the case when there is no limit on the maximum rank, such as the word frequency 
distribution, where the mechanism for data generation is described by the "large number of 



rare events" model ( iBaayen. 



200l|). 



We can imagine a situation in which quantitative law governing the popular names (fewer 
and limited) distribution being distinct from that for rare names (numerous and unlimited). 
A similar idea in separating high-ranking and low-ranking events in an other appli cation (poly- 
morphous Chinese syllables versus regular syllables) was discussed in ( jLi.ll2012bl ). 

As in the case of any empirical fitting of data, other empirical alternatives are possible. If 
particular, with only 100 or less (median of the first regime in the two-piece function) points to 
fit, many other functions other than Beta rank function may also be used. If the heterogeneity 
between high- and low-ranking names does indeed exist, then it is a challenge to find a single 
functional form which could fit the ranked baby popularity data in its whole rank range. 
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